355 research outputs found
OpenTED Browser: Insights into European Public Spendings
We present the OpenTED browser, a Web application allowing to interactively
browse public spending data related to public procurements in the European
Union. The application relies on Open Data recently published by the European
Commission and the Publications Office of the European Union, from which we
imported a curated dataset of 4.2 million contract award notices spanning the
period 2006-2015. The application is designed to easily filter notices and
visualise relationships between public contracting authorities and private
contractors. The simple design allows for example to quickly find information
about who the biggest suppliers of local governments are, and the nature of the
contracted goods and services. We believe the tool, which we make Open Source,
is a valuable source of information for journalists, NGOs, analysts and
citizens for getting information on public procurement data, from large scale
trends to local municipal developments.Comment: ECML, PKDD, SoGood workshop 201
Feature selection in high-dimensional dataset using MapReduce
This paper describes a distributed MapReduce implementation of the minimum
Redundancy Maximum Relevance algorithm, a popular feature selection method in
bioinformatics and network inference problems. The proposed approach handles
both tall/narrow and wide/short datasets. We further provide an open source
implementation based on Hadoop/Spark, and illustrate its scalability on
datasets involving millions of observations or features
From dependency to causality: a machine learning approach
The relationship between statistical dependency and causality lies at the
heart of all statistical approaches to causal inference. Recent results in the
ChaLearn cause-effect pair challenge have shown that causal directionality can
be inferred with good accuracy also in Markov indistinguishable configurations
thanks to data driven approaches. This paper proposes a supervised machine
learning approach to infer the existence of a directed causal link between two
variables in multivariate settings with variables. The approach relies on
the asymmetry of some conditional (in)dependence relations between the members
of the Markov blankets of two variables causally connected. Our results show
that supervised learning methods may be successfully used to extract causal
information on the basis of asymmetric statistical descriptors also for
variate distributions.Comment: submitted to JML
minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information
SCOPUS: ar.jinfo:eu-repo/semantics/publishe
Study of meta-analysis strategies for network inference using information-theoretic approaches
© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Reverse engineering of gene regulatory networks (GRNs) from gene expression data is a classical challenge in systems biology. Thanks to high-throughput technologies, a massive amount of gene-expression data has been accumulated in the public repositories. Modelling GRNs from multiple experiments (also called integrative analysis) has; therefore, naturally become a standard procedure in modern computational biology. Indeed, such analysis is usually more robust than the traditional approaches focused on individual datasets, which typically suffer from some experimental bias and a small number of samples.
To date, there are mainly two strategies for the problem of interest: the first one (”data merging”) merges all datasets together and then infers a GRN whereas the other (”networks ensemble”) infers GRNs from every dataset separately and then aggregates them using some ensemble rules (such as ranksum or weightsum). Unfortunately, a thorough comparison of these two approaches is lacking.
In this paper, we evaluate the performances of various metaanalysis approaches mentioned above with a systematic set of experiments based on in silico benchmarks. Furthermore, we present a new meta-analysis approach for inferring GRNs from multiple studies. Our proposed approach, adapted to methods based on pairwise measures such as correlation or mutual information, consists of two steps: aggregating matrices of the pairwise measures from every dataset followed by extracting the network from the meta-matrix.Peer ReviewedPostprint (author's final draft
On the Impact of Entropy Estimation on Transcriptional Regulatory Network Inference Based on Mutual Information
SCOPUS: ar.jinfo:eu-repo/semantics/publishe
Adversarial Learning in Real-World Fraud Detection: Challenges and Perspectives
Data economy relies on data-driven systems and complex machine learning
applications are fueled by them. Unfortunately, however, machine learning
models are exposed to fraudulent activities and adversarial attacks, which
threaten their security and trustworthiness. In the last decade or so, the
research interest on adversarial machine learning has grown significantly,
revealing how learning applications could be severely impacted by effective
attacks. Although early results of adversarial machine learning indicate the
huge potential of the approach to specific domains such as image processing,
still there is a gap in both the research literature and practice regarding how
to generalize adversarial techniques in other domains and applications. Fraud
detection is a critical defense mechanism for data economy, as it is for other
applications as well, which poses several challenges for machine learning. In
this work, we describe how attacks against fraud detection systems differ from
other applications of adversarial machine learning, and propose a number of
interesting directions to bridge this gap
Information-Theoretic Inference of Large Transcriptional Regulatory Networks
The paper presents MRNET, an original method for inferring genetic networks from microarray data. The method is based on maximum relevance/minimum redundancy (MRMR), an effective information-theoretic technique for feature selection in supervised learning. The MRMR principle consists in selecting among the least redundant variables the ones that have the highest mutual information with the target. MRNET extends this feature selection principle to networks in order to infer gene-dependence relationships from microarray data. The paper assesses MRNET by benchmarking it against RELNET, CLR, and ARACNE, three state-of-the-art information-theoretic methods for large (up to several thousands of genes) network inference. Experimental results on thirty synthetically generated microarray datasets show that MRNET is competitive with these methods.SCOPUS: ar.jinfo:eu-repo/semantics/publishe
Impact of filter feature selection on classification: an empirical study
The high-dimensionality of Big Data poses challenges in data understanding and visualization. Furthermore, it leads to lengthy model building times in data analysis and poor generalization for machine learning models. Consequently, there is a need for feature selection, which allows identifying the more relevant part of the data to improve the data analysis (e.g., building simpler and more understandable models with reduced training time and improved model performance). This study aims to (i) characterize the factors (i.e., dataset characteristics) that influence the performance of feature selection methods, and (ii) assess the impact of feature selection on the training time and accuracy of binary and multiclass classification problems. As a result, we propose a systematic method to select representative datasets (i.e., considering the distributions of several dataset characteristics) in a given repository. Next, we provide an empirical study of the impact of eight feature selection methods on Naive Bayes (NB), Nearest Neighbor (KNN), Linear Discriminant Analysis (LDA), and Multilayer Perceptron (MLP) classification algorithms using 32 real-world datasets and a relative performance measure. We observed that feature selection is more effective in reducing training time (e.g., up to 60% for LDA classifiers) than improving classification accuracy (e.g., up to 5%). Furthermore, we observed that feature selection gives slight accuracy improvement for binary classification (i.e., up to 5%), while it mostly leads to accuracy degradation for multiclass classification. Although none of the studied feature selection methods is best in all cases, for multiclass classification, we observed that correlation based and minimum redundancy maximum relevance feature selection methods gave the best results in accuracy. Through statistical testing, we found LDA and MLP to benefit more in accuracy improvement after feature selection than KNN and NB.The project leading to this publication has received funding from the European Commission under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 955895).Peer ReviewedPostprint (published version
- …